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ABSTRACT 


Online educational videos have emerged as one of the most pop- 
ular modes of learning in the recent years. Studies have shown 
that liveliness is highly correlated to engagement in educational 
videos. While previous work has focused on feature engineering 
to estimate liveliness and that too using only the acoustic infor- 
mation, in this paper we propose a technique called LI[VELINET 
that combines audio and visual information to predict liveliness. 
First, a convolutional neural network is used to predict the visual 
setup, which in turn identifies the modalities (visual and/or audio) 
to be used for liveliness prediction. Second, we propose a novel 
method that uses multimodal deep recurrent neural networks to au- 
tomatically estimate if an educational video is lively or not. On the 
StyleX dataset of 450 one-minute long educational video snippets, 
our approach shows an relative improvement of 7.6% and 1.9% 
compared to a multimodal baseline and a deep network baseline 
using only the audio information respectively. 


Keywords 
Liveliness, Educational Videos, Recurrent Neural Network, Deep 
Learning, LSTM, Engagement, Multimodal Analysis. 


1. INTRODUCTION 


The amount of freely available online educational videos has grown 
significantly over the last decade. Several recent studies [1, 2, 3] 
have demonstrated that when educational videos are not engag- 
ing, students tend to lose interest in the course content. This has 
led to recent research activity in speaking style analysis of educa- 
tional videos. Authors in [4] used crowd-sourced descriptors of 
100 video clips to identify various speaking-style dimensions such 
as liveliness, speaking rate, clarity, formality etc. that drive stu- 
dent engagement and demonstrated that liveliness plays the most 
significant role in video engagement. Using a set of acoustic fea- 
tures and LASSO regression, the authors also developed automatic 
methods to predict liveliness and speaking rate. The Authors in [5] 
analyze the prosodic variables in a corpus of eighteen oral presen- 
tations made by students of Technical English, all of whom were 
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native speakers of Swedish. They found out that high pitch vari- 
ation in speech is highly correlated with liveliness. Arsikere et 
al. [6] built a large scale educational video corpus called StyleX 
for engagement analysis and provided initial insights into the effect 
of various speaking-style dimensions on learner engagement. They 
also found out that liveliness is the most influential dimension in 
making a video engaging. In this paper, we propose a novel mul- 
timodal approach called LIVELINET that uses deep convolutional 
neural networks and deep recurrent neural networks to automati- 
cally identify if an educational video is lively or not. 


A learner can typically perceive or judge the liveliness! of an ed- 
ucational video both through the visual and the auditory senses. 
A lecturer usually makes a video lively by using several visual 
actions such as hand movement, interactions with other objects 
(board/tablet/slides) and audio actions such as modulating voice in- 
tensity, varying speaking rate etc. In the proposed approach, both 
visual and audio information from an educational video are com- 
bined to automatically predict the liveliness of the video. Note that 
a given lecture can also be perceived as lively based on the con- 
textual information (e.g., a historic anecdote) that the lecturer may 
intersperse within the technical content. We however don’t address 
this dimension of liveliness in this work ?. 


This paper is novel in three important aspects. First, the proposed 
approach is the first of its kind that combines audio and visual infor- 
mation to predict the liveliness in a video. Second, a convolutional 
neural network (CNN) is used to estimate the setup (e.g., lecturer 
sitting, standing, writing on a board etc.) of a video. Third, Long 
Short Term Memory (LSTM) based recurrent neural networks are 
trained to classify the liveliness of a video based on audio and 
visual features. The CNN output determines which of the audio 
and/or visual LSTM output should be combined for the liveliness 
prediction. 


We observe that there is a lot of variation in what is being displayed 
in an educational video, e.g., slide/board, lecturer, both slide/board 
and lecturer, multiple video streams showing lecturer and slide etc.. 
These different visual setups usually indicate to what degree the 
audio and the visual information should be combined for predict- 
ing liveliness. For example, when the video feed only displays the 
slide or the board, the visual features do not play a critical role 
in determining liveliness. However, when the video is focussed on 


‘defined as “full of life and energy/active/animated" in dictionary 
> Note that the human labelers who provided the ground truth for 
our database [6] were explicitly asked to ignore this aspect while 
rating the videos 
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the lecturer, the hand gestures, body postures, body movements etc. 
become critical, i.e., the visual component plays a significant role 
in making a video lively. Hence, we first identify the setup of a 
video using a CNN based classifier. Next, depending on the setup, 
we either use both audio and visual information or use only the 
audio information from a video for training/testing of the LSTM 
networks. We train two separate LSTM based classifiers, one each 
for audio and visual modalities, which take a temporal sequence 
of audio/visual features from a video clip as input and predict if 
the clip is lively or not. Finally, audio/visual features from a test 
video clip are forward-propagated through these LSTMs and their 
outputs are combined to obtain the final liveliness label. 


We perform experiments on the StyleX dataset [6], and compare 
our approach with baselines that are based on visual, audio and 
combined audio-visual features. The proposed approach shows rel- 
ative improvement of 7.6% and 1.9% with respect to a multimodal 
baseline and a deep network baseline using only the audio modality 
respectively. 


2. RELATED WORK 


In this section, we discuss the relevant prior art in deep learning 
and multimodal public speaking analysis in videos. 


Deep Learning: Recently deep neural networks have been exten- 
sively used in computer vision, natural language processing and 


speech processing. LSTM [7], a Recurrent Neural Network (RNN) [8] 


architecture, has been extremely successful in temporal modelling 
and classification tasks such as handwriting recognition [9], action 
recognition [10], image and video captioning [11, 12, 13], speech 
recognition [14, 15] and machine translation [16]. CNNs have also 
been successfully used in many practical computer vision tasks 
such as image classification [17], action recognition [18], object 
detection [19, 20], semantic segmentation [2 |], object tracking [22] 
etc.. In this work, we use CNNs for visual setup classification and 
LSTMs for the temporal modelling of audio/visual features. 


Multimodal Public Speaking Analysis: Due to the recent devel- 
opment of advanced sensor technologies, there has been signifi- 
cant progress in the analysis of public speaking scenarios. The 
proposed methods usually employ use of multiple modalities such 
as microphone, RGB camera, depth sensor, kinect sensor, Google 
glasses, body wearables, etc. and analyse the vocal behaviour, body 
language, attention, eye contact, facial expression of the speakers 
along with the engagement of the audiences [23, 24, 25, 26]. Gan 
et al. [23] proposed baseline methods to do the quantification of 
several above mentioned parameters by analysing the multi-sensor 
data. Nguyen et al. [24] and Echeverria et al. [25] used kinect sen- 
sors to recognize the bodily expressions, body posture, eye con- 
tact of the speaker and thereby, providing feedback to the speaker. 
Chen et al. [26] presented an automatic scoring model by using ba- 
sic features for the assessment of public speaking skills. It must be 
noted that all these works rely significantly on the sensor data cap- 
tured during the presentation for their prediction task and hence, 
they are not applicable to educational videos that are available on- 
line. Moreover, all these approaches use shallow and hand-crafted 
audio features along with the sensor data. On the contrary, our pro- 
posed method uses deep learning based automatic feature extrac- 
tion method for both audio and visual modalities from the video, 
and predicts the liveliness. 


To the best of authors’ knowledge, this is the first approach that 
uses a deep multimodal approach for educational video analysis. 
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3. PROPOSED APPROACH 


In this section, we describe the details of the proposed approach. 
We begin with the description of how a given video is modeled as 
a sequence of temporal events, followed by the visual setup clas- 
sification algorithm. Next, we provide the details of the audio and 
visual feature extraction. Finally, the details of the proposed mul- 
timodal method for liveliness prediction is described. The pipeline 
of the proposed approach is shown in Figure |. The input to the 
system is a fixed length video segment of 10 seconds during both 
training and testing (referred to as 10-second clips throughout the 
paper). For any educational video of arbitrary length, 10-second 
clips are extracted with 50% overlap between the adjacent clips 
and the overall video liveliness label is determined based on the 
majority voting. In Section 5.1 we provide further details regard- 
ing extraction of these 10-second clips from the Stylex dataset. 


3.1 Video Temporal Sequencing 

Each 10-second clip is modeled as a temporal sequence of smaller 
chunks. If the total number of chunks in a 10-second clip is T’, 
then {v1,v2,...,ve,...,ur} and {a1, a2,...,a¢,...,a7} represent 
the temporal sequence of visual and audio features corresponding 
to each 10-second clip respectively. Note that, v; (Section 3.3) and 
az (Section 3.4) are input to the visual and audio LSTM at time 
instant t. 


3.2 Visual Setup Classification 

One of our objectives is to automatically determine if both audio 
and visual information are required for liveliness prediction. If a 
video displays only slide/board, the visual features are less likely 
to contribute to the liveliness. However, if the camera displays that 
the lecturer is in a sitting/standing posture or is interacting with 
the content, the visual features could significantly contribute to the 
video liveliness. Hence, we collect a training dataset and train a 
CNN to automatically estimate the setup of a video. We describe 
the definition of the labels, the data collection procedure and the 
details of the CNN training in the next three subsections. 


3.2.1 Video Setup Label Definition 


We define five different categories which cover almost all of the 
visual setups usually found in educational videos. 


e Content: This category includes the scenarios where the video 
feed mainly displays the content such as a blackboard or a slide 
or a paper. Frames, where the hand of the lecturer and/or pens or 
pointers are also visible, are included in this category. However, 
the video clips belonging to this category should not include any 
portion of the lecturer’s face. Since the lecturer is not visible 
in this case, only the audio modality will be used for liveliness 
prediction. 

e Person Walking/Standing: In this scenario, the content such 
as blackboard/slide are not visible. However, the lecturer walks 
around or remain in a standing posture. The lecturer’s face and 
upper body parts (hand/shoulder) should be visible. Both audio 
and visual modality are used to predict liveliness in this case. 

e Person Sitting: The content is not visible and the camera should 
focus only on the lecturer in a sitting posture. Both audio and 
visual modalities are considered for liveliness prediction. 

e Content & Person: This includes all the scenarios where the up- 
per body of the lecturer and the content both are visible. Frames, 
where the lecturer points to the slide/board or writes something 
on the board, are included in this category. Here also both the 
modalities are used for liveliness. 
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Figure 1: The overall pipeline of the proposed approach LIVELINET. The input to the system is a 10-second clip and output is the liveliness 


prediction label. 


Miscellaneous: This category includes all other scenarios which 
are not covered in the above four categories, e.g., two different 
video feeds for professor and content, students are also visible, 
multiple people (laboratory setups) are visible in the scene etc.. 
Since the frames from this category have significant intra-class 
variation and noise, we use only the audio information for liveli- 
ness prediction. 


Some example frames from the above five categories are shown 
in Figure 2. The intra-class variation clearly shows the inherent 
difficulty of the setup classification task. 


3.2.2 Label Collection 

We used the StyleX dataset [6] for the liveliness prediction task. 
Although the liveliness labels were available along with the videos, 
video setup labels were not available. So we collect these additional 
labels using Amazon Mechanical Turk. We asked the Mturkers to 
look at the 10-second clips from StyleX and choose one of the five 
labels defined above. Each video clip is shown to three MTurk 
labellers and we assign the labels where at least two of the three 
labellers agreed. Although in most of the clips, all frames belong 
to only one of the above five categories, there were some 10-second 
clips (around 5%) where frames from more than one categories 
were present. In those cases, labellers were asked to provide the 
label based on the label of the majority of frames. 


3.2.3. CNN for Label Classification 
We used a CNN architecture to classify the setup of a 10-second 
clip. During training phase, all the frames belonging to a 10-second 
clip are used as the samples for the corresponding clip category. For 
this task, we use the same CNN architecture as used in [17]. In [17], 
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the authors proposed a novel neural network model called Alexnet 
which improved the state-of-the-art imagenet classification [27] ac- 
curacy by a significant margin. Researchers in the computer vision 
community have often used the Alexnet architecture for other kinds 
of computer vision applications [28, 29]. Deep neural networks 
usually have millions of parameters. If the available training data 
for a particular classification task is not large enough, then train- 
ing a deep neural network from scratch might lead to over fitting. 
Hence, it is a common practice to use a CNN which is already pre- 
trained for a related task and fine-tune only the top few layers of 
the network for the actual classification task. 


We fine-tune the final three fully connected layers (fc6, fc7, fc8) 
of Alexnet for visual setup classification. First, we remove the 
1000 node final layer fc8 (used to classify 1000 classes form ima- 
genet [17]) from the network and add a layer with only five nodes 
because our objective is to classify each frame into one of the 
five setup categories. Since, the weights of this layer are learned 
from scratch we begin with a higher learning rate of 0.01 (same as 
Alexnet). We also fine tune the previous two fully connected layers 
(fc6 and fc7). However, their weights are not learned from scratch. 
We use a learning rate of 0.001 for these layers while perform- 
ing the gradient descent with the setup classification training data. 
Once the Alexnet has been fine-tuned a new frame can be forward 
propagated through this network to find the classification label. For 
a test 10-second clip, we determine the setup label for each frame 
individually and assign the majority label to the full clip. We refer 
to this CNN as Setup-CNN. 


3.3. Visual Feature Extraction 
In this section, we describe the details of the visual features used 
for predicting the liveliness of a video clip. The visual modality is 
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Figure 2: Example frames from different visual setup categories. 
We also point out the modalities which are used for liveliness in 
each of these setups. 


used to capture the movement of the lecturer. We used a state-of- 
the-art deep CNN architecture to represent the visual information 
in the form of motion across the frames. Unlike the CNN model 
used in Section 3.2.3 (where input to the model was an RGB im- 
age comprising of 3 channels), the input to the CNN model in this 
section is formed by stacking horizontal and vertical optical flow 
images from 10 consecutive frames of a video clip. We refer to 
this CNN model as Motion-CNN in the subsequent sections of the 


paper. 


For the Motion-CNN, we fine-tuned the VGG-16 temporal-net trained 


on UCF-101 [30] action dataset. The final fully connected layers 
(fc6, fc7, and fc8) of VGG-16 are fine-tuned with respect to the 
liveliness labels of the videos. The activations of the fc7 layer are 
extracted as the visual representation of the stacked optical flows 
which were provided as the input to the model. Given a 10-second 
clip, we generate a feature representation v; (Section 3.1) from the 
corresponding 10 frame optical flow stack. We provide v; as an 
input to LSTM module at time ¢ to create a single visual represen- 
tation for the full 10-second clip (Section 5.2). 


Implementation Details: We use the GPU implementation of TVL1 
optical flow algorithm [31]. We stack the optical flows in a 10- 
frame window of a video clip to receive a 20-channel optical flow 
image as an input (one horizontal channel and one vertical chan- 
nel for each frame pair) to the Motion-CNN model. In Motion- 
CNN model, we also change the number of neurons in fc7 layer 
from 4096 to 512 before finetuning the model to get a lower di- 
mensional representation of the 10 frame optical flow stack. We 
adopt a dropout ratio of 0.8 and set the initial learning rate to 0.001 
for fc6, and to 0.01 for fc7 and fc8 layers. The learning rate is 
reduced by a factor of 10 after every 3000 iterations. 


3.4 Audio Feature Extraction 

We extract the audio feature a; (Section 3.1) using a convolutional 
neural network. For each t, we find a corresponding one second 
long audio signal from the 10-second clip. We apply the Short- 
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Time Fourier Transformation to convert each one second 1-d audio 
signal into a 2-D image (namely log-compressed mel-spectrograms 
with 128 components) with the horizontal axis and vertical axis be- 
ing time-scale and frequency-scale respectively. The CNN features 
are extracted from these spectrogram images and used as inputs to 
the LSTM. We finetune the final three layers of Alexnet [17] to 
learn the spectrogram CNN features. We change the number of 
nodes in fc7 to 512 and use the fc7 representation corresponding 
to each spectrogram image as input to the LSTMs. The fine tuned 
Alexnet for the spectrogram feature extraction is referred as Audio- 
CNN. Learning rate and dropout parameters are chosen same as 
mentioned in Section 3.3. 


3.5 Long Short Term Memory Networks 

The Motion-CNN (Section 3.3) and the audio-CNN (Section 3.4) 
model only the short-term local motion and audio patterns in the 
video respectively. We further employ LSTMs to capture long-term 
temporal patterns/dependencies in the video. LSTMs map the ar- 
bitrary length sequential information of input data to output labels 
with multiple hidden units. Each of the units has built-in memory 
cell which controls the in-flow, out-flow, and accumulation of in- 
formation over time with the help of several non-linear gate units. 
We provide a detailed description of LSTM networks below. 


RNNs [8] are a special class of artificial neural networks, where 
cyclic connections are also allowed. These connections allow the 
networks to maintain a memory of the previous inputs, making 
them suitable for modeling sequential data. Given an input se- 
quence x of length 7’, the fixed length hidden state or memory of 
an RNN h is given by 


he = (ae, r-1) t=1,...,T (1) 


We use ho = 0 in this work. Multiple such hidden layers can be 
stacked on top of each other, with x; in equation | replaced with 
the activation at time ¢ of the previous hidden layer, to obtain a 
‘deep’ recurrent neural network. The output of the RNN at time ¢ 
is computed using the state of the last hidden layer at t as 


ye = O(Wynh? + by) (2) 


where @ is a non-linear operation such as sigmoid or hyperbolic 
tangent for binary classification or softmax for multiclass classifi- 
cation, b, is the bias term for the output layer and n is the num- 
ber of hidden layers in the architecture. The output of the RNN 
at desired time steps can then be used to compute the error and 
the network weights updated based on the gradients computed us- 
ing Back-propagation Through Time (BPTT). In simple RNNs, the 
function g is computed as a linear transformation of the input and 
previous hidden state, followed by an element wise non-linearity. 


g(t, he-1) = O(Whate + Wanhe-1 + bn) (3) 


Such simple RNNs, however, suffer from the vanishing and ex- 
ploding gradient problem [7]. To address this issue, a novel form 
of recurrent neural networks called the Long Short Term Memory 
(LSTM) networks were introduced in [7]. The key difference be- 
tween simple RNNs and LSTMs is in the computation of g, which 
is done in the latter using a memory block. An LSTM memory 
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block consists of a memory cell c and three multiplicative gates 
which regulate the state of the cell - forget gate f, input gate 7 and 
output gate o. The memory cell encodes the knowledge of the in- 
puts that have been observed up to that time step. The forget gate 
controls whether the old information should be retained or forgot- 
ten. The input gate regulates whether new information should be 
added to the cell state while the output gate controls which parts of 
the new cell state to output. The equations for the gates and cell 
updates at time ¢ are as follows: 


is = o(Winate + Wine + bi) (4) 
fi = 0(Wyave + Wenhe-1 + bf) (5) 
onltt—1 + bo) (6) 
Ce = fr OCe-1 + te © b(Weete + Wenht-1 + be) (7) 


or = O(Woarrtt + 


he = 04 © Ct (8) 


where © is the element-wise multiplication operation, o and ¢ are, 
respectively, the sigmoid and hyperbolic tangent functions, and hz 
is the output of the memory block. Like simple RNNs, LSTM net- 
works can be made deep by stacking memory blocks. The output 
layer of the LSTM network can then be computed using equation 2. 
We refer the reader to [7] for more technical details on LSTMs. 
The details of the architecture used in this work are described in 
section 5.2 


3.6 Multimodal LSTM for liveliness classifi- 


cation 

In the proposed approach, LSTMs are used to learn the discrimi- 
native visual and audio feature representations for liveliness. The 
estimates from audio and visual LSTMs are combined to estimate 
the overall liveliness of videos. For setup categories ‘Person Walk- 
ing/Standing’, ‘Person Sitting’ and “Content & Person’ setup, both 
the modalities are used for liveliness prediction. For the remaining 
videos from ‘Content’ and ‘Miscellaneous’ categories, only the au- 
dio LSTM representation is used to determine the liveliness label. 
The details of the proposed approach are described below: 


e Visual-LSTM: A multi-layer LSTM network is trained to learn 
the discriminative visual features for liveliness. The number of 
layers and the number of nodes in each layer in the LSTM net- 
work are determined based on a validation dataset. The input 
to the network at each time step ¢ is a 512 dimensional visual 
feature extracted as described in 3.3. 

e Audio-LSTM: The approach for training an audio LSTM is sim- 
ilar to that for training the visual LSTM. The only difference is 
that the visual features are replaced by the audio features as de- 
scribed in 3.4. 

e Multimodal-LSTM: Once we learn the discriminative audio and 
visual LSTMs, the next step is to combine their predictions to de- 
termine the final liveliness. The visual and audio features from 
each 10-second clip are now forward-propagated through the 
visual-LSTM and audio-LSTM respectively. Once the features 
corresponding to all the time-steps of a clip have been forward- 
propagated, the liveliness prediction from each of these LSTM 
networks are obtained. If the setup corresponding to a clip re- 
quires combining audio and visual modality information, we as- 
sign the clip a positive liveliness label if any one of the visual- 
LSTM or Audio-LSTM network predicts the label of the clip as 
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positive. Otherwise, the audio-LSTM label is used as the final 
label for the 10-second clip. 


The proposed multimodal pipeline for liveliness prediction is called 
LIVELINET and will be referred as that from now on. 


4. BASELINE DETAILS 


In this section, we describe several baselines which do not use any 
deep neural network for feature extraction or classification. How- 
ever, these methods have demonstrated state-of-the-art accuracy in 
many video/audio classification applications. We wanted to eval- 
uate how good these “shallow” methods perform on the liveliness 
prediction task. 


4.1 Visual Baseline 

The visual baseline consists of training a SVM classifier on state- 
of-the-art trajectory features aggregated into local descriptors. Im- 
proved Dense Trajectories (IDT) [32] have been shown to achieve 
state of the art results on a variety of action recognition benchmark 
datasets. Visual feature points on the visual frames are densely 
sampled and tracked across subsequent frames to obtain dense tra- 
jectories. Once the IDTs are computed, VLAD (Vector of Locally 
Aggregated Descriptors) encoding [33] is used to obtain a com- 
pact representation of the video. We set the number of clusters for 
VLAD encoding at 30 and obtain a 11880-dimensional represen- 
tation for each video. SVM classifier with RBF kernel is used for 
the classification. We compare this visual baseline against the pro- 
posed approach. 


4.2 Audio Baselines 

We compare LIVELINET with two different audio baselines; the 
first one uses bag of audio words and the second one uses Hid- 
den Markov Models (HMM). The audio features are computed at 
a frame rate of 10 ms. The features are computed using the open 
source audio feature extraction software OpenSMILE [34]. Moti- 
vated by the findings in [35] and [36], where the authors show su- 
perior performance on various paralingustic challenges, our frame- 
level features consist of (a) loudness, defined as normalized inten- 
sity raised to a power of 0.3, (b) 12 Mel Frequency Cepstral Coef- 
ficients (MFCCs) along with the log energy (IM F'CCo) and their 
first and second order delta values to capture the spectral varia- 
tion, and (c) voicing related features such as the fundamental fre- 
quency (FO), voicing probability, harmonic noise ratio and zero 
crossing rate. (Intensity and fundamental frequency features have 
been found to be beneficial in liveliness classification in [4] also.) 
Authors in [36] refer to these frame-level features as Low Level 
Descriptors (LLD) and provide a set of 21 functionals based on 
quartile and percentile to generate chunk level features. We use all 
of these LLDs and the functionals for the audio feature extraction. 
For every one second audio signal (obtained using the same method 
as described in Section 3.4), these frame-level features are concate- 
nated to form a (44 * 100 = 4400) dimensional feature vector. The 
dimensionality of the chunk-level audio feature is further reduced 
to 400 by performing a PCA across all the chunks in the training 
data. 


The audio features from all the one second audio signals in the 
training videos are clustered into 256 clusters. A nearest neighbour 
cluster centre is found for each of these audio features. We then 
create a 256-dimensional histogram for each clip based on these 
nearest neighbour assignments. This approach, known as the bag- 
of-words model is popular in computer vision and natural language 
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processing, and is beginning to be extended to the audio domain 
in the form of bag-of-audio-words (BoAW) (e.g., [37]). A SVM 
classifier with RBF kernel is trained on this BoAW representation. 


As a second baseline, two 3-state HMMs, one each for the posi- 
tive and the negative class, are trained using the sequence of audio 
features computed on these one second audio signals. Only left-to- 
right state transitions are permitted with a potential skip from the 
first state to the third state. Each state is modeled as 16-mixture 
Gaussian Mixture Model. The 44 frame-level LLD are the inputs 
to the HMM framework. The Scilearn implementation of HMM is 
used. 


4.3 Multimodal baseline 


For combining the audio and video modalities we employ a clas- 
sifier stacking approach. Stacking involves learning an algorithm 
to combine the predictions of other classifiers. We first train two 
SVM classifiers on audio and video features separately. The fea- 
tures and kernels used here are the same as the individual audio 
and visual baselines described earlier. Subsequently, another SVM 
classifier (with RBF kernel) is trained on the predictions of the au- 
dio and video classifiers to make the final prediction. We compare 
this baseline against the proposed multimodal classifier. 


5. EXPERIMENTAL RESULTS 

In this section, we provide the details of the experimental results. 
First, we describe the StyleX dataset followed by the details of the 
proposed LSTM network architecture and setup classification re- 
sults. Next, we provide the liveliness classification results using 
the proposed multimodal deep neural network method. Finally, we 
perform some preliminary quality analysis of the lively/not-lively 
videos. 


5.1 Dataset 


We use the StyleX dataset proposed in [6] for our experiments. 
StyleX comprises of 450 one-minute video snippets featuring 50 
different instructors, 10 major topics in engineering and various 
accents of spoken English. Each video was annotated by multi- 
ple annotators for liveliness. The scores from all annotators (in 
the range 0 — 100, where 0 implies least lively and 100 implies 
most lively) corresponding to a particular video were averaged to 
obtain the mean liveliness score. The bimodal distribution of the 
mean liveliness scores were analyzed to estimate the threshold for 
binary label assignment (lively and not-lively). All videos with 
liveliness score above the threshold were assigned to the positive 
class whereas the remaining videos were assigned to the negative 
class. At a threshold of 54, we have 52% videos in the nega- 
tive class (Thus, a simple majority-class classifier would lead to 
52% classification accuracy). Out of the 450 StyleX videos, we 
randomly choose 60% for training, 20% for validation and 20% 
for testing while ensuring a proportional representation of both the 
classes in each subset. Since the proposed method takes 10-second 
clips as input during training and testing, we further split each one- 
minute video into 10-second clips bookended by silence, with a 
50% overlap across adjacent clips. Each of these 10-second clips 
are assigned the same label as the actual one-minute videos and are 
treated as independent training instances. Likewise, during test, the 
10-second clips are extracted from one-minute videos. The label is 
predicted for each 10-second clip and the label of the one-minute 
video is determined based on the majority vote. 


5.2. LSTM Architecture Details 
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Figure 3: A comparison of results obtained from our proposed 
Multimodal-LSTM (LIVELINET) approach and the baselines. 


The parameters of the proposed visual-LSTM and audio-LSTM 
were selected using the validation set. The learning rate was ini- 
tialized to 10~* and decayed after every epoch. Dropout rate of 
0.2 was used for the activations of the last hidden layer. We tried 
nine different combinations for the number of hidden layers (1, 2, 
3) and number of units in each layer (128, 256, 512), for both visual 
and audio modalities. Visual-LSTM with 2 layers and 256 hidden 
units and audio-LSTM with 2 layers and 256 hidden units led to the 
optimal performance on the validation set. 


5.3 Setup Classification 

In this section, we report the visual setup classification results ob- 
tained using the framework proposed in Section 3.2. As discussed 
in Section 5.1, the number of video clips used is 2700 for the train- 
ing phase and 900 each for the validation and testing phase (all clips 
are approximately 10 seconds long). The network is trained with 
all the frames (~ 300K) extracted from the training video clips. At 
the time of testing, a label is predicted for each of the frame in a 10- 
second clip and their majority is taken as the label of the full clip. 
We evaluate 5-way classification accuracy of the video clips into 
different visual setups. Our proposed CNN architecture achieves a 
classification accuracy of 86.08% for this task. However, we notice 
that for the task of liveliness prediction, we only require the classi- 
fication of video clips into two different classes - (a) clips requiring 
only audio modality, and (b) clips requiring both audio and video 
modality for liveliness prediction. For this task of binary classifi- 
cation (‘Content or Miscellaneous’ v/s ‘Person Walking/Standing 
or Person Sitting or Content & Person’), our system achieves an 
accuracy of 93.74%. Based on the visual setup label of a clip, we 
use either both audio/visual or only audio modality for liveliness 
prediction. 


5.4 Liveliness Classification 

In this section, we present the performance of proposed multimodal 
deep neural network for liveliness prediction. Figure 3 depicts the 
results of our experiments. We obtain an accuracy of 70.6% with 
the Visual-LSTM, an absolute improvement of 6.2% over the vi- 
sual baseline. The two audio baselines of HMM and BoAW meth- 
ods lead to an accuracy of 60% and 63.3%, respectively. The 
Audio-LSTM setup leads to 75.0% accuracy, an increase of 11.7% 
over the best audio baseline. The proposed Multimodal-LSTM 
method (LIVELINET) achieves an accuracy of 76.5% compared 
to 71.1% obtained using the audio-visual baseline, an absolute im- 
provement of 5.4% (relative improvement of 7.6%). We are also 
relatively 1.9% better than using only the audio-LSTM. The boost 
in accuracy when using both the modalities indicates that the infor- 
mation available from audio and visual modalities are complimen- 
tary and the proposed approach exploits it optimally. 
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5.5 Qualitative Analysis 

We also perform qualitative analysis of the videos that are predicted 
lively/not-lively by LIVELINET. Our goal is to determine the gen- 
eral visual and audio patterns that make a video lively. These is 
the preliminary analysis of exemplar lively and exemplar non-lively 
lectures. We continue to perform a more systematic and in-depth 
qualitative analysis to understand two aspects: (a) patterns that the 
proposed classifier identifies as representative of lively and of not- 
lively, and (b) general audio-visual patterns that may have influ- 
enced the human labelers in assigning the ‘lively or non-lively’ 
label . One of the current directions for extending this work is 
to understand pedagogically-proven best practices of teaching and 
codify that knowledge in the form of features to be extracted and 
fed to the classifier. Some example frames from lively and not- 
lively videos as predicted by LIVELINET are shown in Figure 4. 
Some of our initial finding are: (a) Lecturers who alternate between 
making eye contact with the audience and looking at the content 
are perceived as more lively. (b) Similarly, voice modulations and 
moving around in the classroom (as opposed to sitting in place) 
and specific visual references (like pointing to written content) to 
synchronize with the spoken content seem to positively influence 
perceived liveliness. 


6. CONCLUSION 

We propose a novel method called LI'VELINET that combines vi- 
sual and audio information in a deep learning framework to predict 
liveliness in an educational video. First, we use a CNN architec- 
ture to determine the overall visual style of an educational video. 
Next, audio and visual LSTM deep neural networks are combined 
to estimate if a video is lively or not-lively. We performed experi- 
ments on the StyleX dataset and demonstrated significant improve- 
ment compared to the state-of-the-art methods. Future directions 
include incorporating text-based features for a content-based live- 
liness scoring. We also note that LI[VELINET is going to be part of 
our e-learning platform TutorSpace. 
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Figure 4: Some example frames from videos predicted as lively and not-lively by our proposed method LIVELINET. The setup labels 


predicted by the proposed Setup-CNN approach are also shown. 
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